Summary

### Policies

A deterministic policy is a mapping \pi: \mathcal{S}\to\mathcal{A} . For each state s\in\mathcal{S} , it yields the action a\in\mathcal{A} that the agent will choose while in state s .
A stochastic policy is a mapping \pi: \mathcal{S}\times\mathcal{A}\to [0,1] . For each state s\in\mathcal{S} and action a\in\mathcal{A} , it yields the probability \pi(a|s) that the agent chooses action a while in state s .

The state-value function for a policy \pi is denoted v_\pi . For each state s \in\mathcal{S} , it yields the expected return if the agent starts in state s and then uses the policy to choose its actions for all time steps. That is, v_\pi(s) \doteq \text{} \mathbb{E} \pi[G_t|S_t=s] . We refer to v \pi(s) as the value of state s under policy \pi .
The notation \mathbb{E} \pi[\cdot] is borrowed from the suggested textbook, where \mathbb{E} \pi[\cdot] is defined as the expected value of a random variable, given that the agent follows policy \pi .

The Bellman expectation equation for v_\pi is: v_\pi(s) = \text{} \mathbb{E} \pi[R {t+1} + \gamma v_\pi(S_{t+1})|S_t =s].

A policy \pi' is defined to be better than or equal to a policy \pi if and only if v_{\pi'}(s) \geq v_\pi(s) for all s\in\mathcal{S} .
An optimal policy \pi_ satisfies \pi_ \geq \pi for all policies \pi . An optimal policy is guaranteed to exist but may not be unique.
All optimal policies have the same state-value function v_* , called the optimal state-value function .

The action-value function for a policy \pi is denoted q_\pi . For each state s \in\mathcal{S} and action a \in\mathcal{A} , it yields the expected return if the agent starts in state s , takes action a , and then follows the policy for all future time steps. That is, q_\pi(s,a) \doteq \mathbb{E} \pi[G_t|S_t=s, A_t=a] . We refer to q \pi(s,a) as the value of taking action a in state s under a policy \pi (or alternatively as the value of the state-action pair s, a ).
All optimal policies have the same action-value function q_* , called the optimal action-value function .

Once the agent determines the optimal action-value function q_ , it can quickly obtain an optimal policy \pi_ by setting \pi_ (s) = \arg\max_{a\in\mathcal{A}(s)} q_ (s,a) .